use OR(<<, >>) for all rotations #5

oconnor663 · 2019-12-22T06:54:01Z

Discussion/experiment PR: I was playing around with different rotations, and I found that these simplified ones seem to perform a lot better under Clang/Rustc. Most of the functions improve by about 6%, but BLAKE2b improves by 14%. Have you run into this before? Is it a known Clang vs GCC thing?

sneves · 2019-12-22T17:58:33Z

This is entirely dependent on architecture. I imagine you measured this on Skylake or Skylake-X. Also cycle counts are more useful than percentages to understand the difference.

[v]ps{r,l}l{d, q} used to be able to be dispatched via a single execution port up to and including Haswell. Thus, the sequence psrld, pslld, por necessarily took 3 cycles to run, whereas pshufb took 1 cycle (and you could run 2 of them per cycle from 45nm Core 2 to Sandy Bridge).

With Skylake [v]ps{r,l}l{d, q} can be dispatched to one of two execution ports, so the sequence psrld, pslld, por has 2 cycles of latency. In Skylake-X, clang will recognize the rotation idiom and replace it with vprord, whereas that will not happen with the shuffles.

In parallel modes, where latency does not matter much but overall throughput will, doing 4 independent rotations with pshufb will cost you exactly 4 cycles of throughput, whereas psrld, pslld, por will cost the same 4 cycles. The main difference will be the port pressure, which is concentrated on port 5 with the shuffles, and more evened out with the arithmetic.

oconnor663 · 2019-12-22T18:13:21Z

I was measuring this on the Skylake-SP server that AWS gives me, and on my Kaby Lake laptop.

What's the best way to treat these microarchitecture-specific differences? Should we generally optimize for the most modern thing? Is it worth shipping both and putting it behind a compiler flag?

sneves · 2019-12-23T08:16:05Z

Those are all questions without definitive answers. If you don't mind the maintenance, having a version for each major microarchitecture would be the best solution. But generally the solution that works the best overall (for some definition of "overall") is preferable from a maintenance standpoint.

sneves · 2019-12-24T15:42:39Z

So this performance optimization looks like more of a LLVM "bug" than an actual optimization. In fact, I believe I had already seen this behavior before somewhere, and then forgot about it.

The real root cause here is that for rotations by 16, LLVM prefers to use the pair vpshuflw ymm0,ymm0,0x39; vpshufhw ymm0,ymm0,0x39 rather than vpshufb ymm0,ymm0,ymm10 as instructed. With -mavx2 but in the absence of an explicit march=... directive, clang defaults to an older instruction scheduling model where pshufb either does not exist or is slower.

And the funny thing is that with your patch, the pshufb for rotation by 16 is actually generated (in fact, all of the same pshufbs are generated). By decomposing the pshufb into a vector shuffle, and reassembling it back, LLVM ends up pessimizing the code. Go figure.

So the solution is simple: where Clang/LLVM is concerned, if using AVX2 also add -march=haswell, which should choose a reasonable instruction scheduler.

sneves · 2019-12-24T17:43:11Z

The _mm256_shuffle_bytes intrinsic appears to be decomposed into a general vector shuffle, which then gets pattern matched as a 16-lane 16-bit element shuffle, and the general-purpose method is matched before the pshufb optimization.

And the reason it's "fixed" with -march=haswell is because of this:

  // Depth threshold above which we can efficiently use variable mask shuffles.
  int VariableShuffleDepth = Subtarget.hasFastVariableShuffle() ? 2 : 3;
  AllowVariableMask &= (Depth >= VariableShuffleDepth) || HasVariableMask;

So the pair vpshuflw, vpshufhw will get reoptimized back into a single shuffle if the architecture in question is deemed to have fast variable shuffles. Haswell introduces this. The default subtarget, however, is based on Sandy Bridge and does not. It seems a bit weird to introduce this feature with Haswell; byte shuffles since the 45nm Core 2 were already single-cycle.

Anyway, long story short, this is definitely a problem with LLVM.

use OR(<<, >>) for all rotations

235a23a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use OR(<<, >>) for all rotations #5

use OR(<<, >>) for all rotations #5

oconnor663 commented Dec 22, 2019

sneves commented Dec 22, 2019

oconnor663 commented Dec 22, 2019

sneves commented Dec 23, 2019

sneves commented Dec 24, 2019

sneves commented Dec 24, 2019 •

edited

Loading

use OR(<<, >>) for all rotations #5

Are you sure you want to change the base?

use OR(<<, >>) for all rotations #5

Conversation

oconnor663 commented Dec 22, 2019

sneves commented Dec 22, 2019

oconnor663 commented Dec 22, 2019

sneves commented Dec 23, 2019

sneves commented Dec 24, 2019

sneves commented Dec 24, 2019 • edited Loading

sneves commented Dec 24, 2019 •

edited

Loading